Experimental design
Update Mode
We distinguish between two fundamentally different update mechanisms:
- Synchronous update, which defines a deterministic dynamical system: from any given state, the successor state is uniquely determined.
- Asynchronous update, which induces a stochastic process: at each time step, a randomly selected node is updated, leading to multiple possible successor states.
This distinction is critical, as asynchronous dynamics introduce temporal dependence and potential autocorrelation in trajectories. In particular, long residence times in attractors or local cycles may reduce the effective information content of sampled data. Consequently, naive dense sampling may lead to strongly correlated observations, while aggressive subsampling may destroy causal ordering information.
Scoring Functions
We employ two scoring functions implemented in BNFinder2, which differ in how they trade off data fit against model complexity. Let \(G\) denote a candidate network structure and \(D\) the observed dataset.
Minimal Description Length (MDL)
The MDL score is defined as \[
\mathrm{MDL}(G \mid D)
= - \log P(D \mid G, \hat{\theta})
\times \frac{1}{2} , |\theta_G| , \log |D|,
\]
where \(\hat{\theta}\) are maximum-likelihood parameters and \(|\theta_G|\) denotes the number of free parameters implied by the graph structure \(G\).
The first term rewards goodness of fit, while the second term penalizes model complexity. As a consequence, MDL favors simpler graphs when data are scarce and becomes less restrictive as sample size increases. This makes MDL particularly sensitive to undersampling and normalization effects.
Bayesian–Dirichlet equivalence (BDe)
The BDe score evaluates the marginal likelihood
\[
\mathrm{BDe}(G \mid D)
= - \log \int P(D \mid G, \theta), P(\theta \mid G), d\theta,
\]
assuming a Dirichlet prior over conditional probability tables. Under standard assumptions, this integral has a closed-form.
Unlike MDL, BDe incorporates prior beliefs and smooths parameter estimates, which can stabilize inference in low-data regimes but may also reduce sensitivity to subtle structural differences.
By comparing MDL and BDe, we assess whether observed reconstruction effects are driven primarily by data properties or by the inductive bias of the scoring function.
One caveat is that our implementation of those functions is simplified compared to implemented in BNfinder, value obtained is different but monotonicity is preserved and therefore do not affect the validity of our conclusions.
Evaluation Metrics
Reconstruction quality is evaluated using structure-based graph distance measures. Each metric captures a distinct notion of discrepancy between the true network \(G^\ast\) and the inferred network \(\hat{G}\).
Adjusted Hamming Distance (AHD)
Let \(A^\ast\) and \(\hat{A}\) denote the adjacency matrices of \(G^\ast\) and \(\hat{G}\). The adjusted Hamming distance is defined as
\[
\mathrm{AHD}(G^\ast, \hat{G})
= \frac{1}{|E^\ast| + |\hat{E}|}
\sum_{i,j} \mathbf{1}_{{A^\ast_{ij} \neq \hat{A}_{ij}}}.
\] AHD measures the proportion of mismatched edges, normalized by graph size and allows comparisons across networks of different sizes. This metric serves as the primary measure of structural accuracy.
Structural Hamming Distance (SHD)
SHD counts the minimum number of edge insertions, deletions, or reversals required to transform \(\hat{G}\) into \(G^\ast\). $ (G^, ) . $ While widely used, SHD aggregates heterogeneous error types and does not distinguish between missing, extra, or misoriented edges, limiting its interpretability.
Structural Intervention Distance (SID)
SID measures the number of node pairs \((i,j)\) for which the causal effect of intervening on \(i\) differs between \(G^\ast\) and \(\hat{G}\).
\[
\mathrm{SID}(G^\ast, \hat{G})
= \left| {(i,j) : P(j \mid \mathrm{do}(i))*{G^\ast}
\neq P(j \mid \mathrm{do}(i))*{\hat{G}} } \right|.
\] Where \(\mathrm{do}(i=n)\) set node \(i\) to have value \(n\) at time step, regardles of Boolean update rule.
SID is sensitive to edge orientation and captures errors that affect causal interpretability, even when the undirected structure is largely correct.
Using these metrics jointly allows us to separate purely topological accuracy (AHD, SHD) from correctness of implied causal relationships (SID) and to identify metric-specific effects of sampling and model selection. ***
Independence
To avoid introducing structural bias, Boolean networks are generated randomly for each condition:
- each node is assigned a random number of parents (uniformly chosen from \(\{ 1 ,2 ,3 \}\)),
- Boolean update functions are sampled randomly. All generated networks are accepted without filtering. Repetitions under identical conditions are treated as independent realizations.
Sample Size Normalization
To ensure comparability across networks of different sizes, we introduce sample-size normalization. Let:
- \(n_{\text{nodes}}\) denote the number of nodes,
- \(T_{\text{length}}\) the trajectory length,
- \(k\) a normalization constant.
The total number of sampled time points is defined as: \[
\begin{align}
T_{\text{amount}} & = \frac{{n_{\text{nodes}} \cdot k}}{T_{\text{length}}}
\end{align}
\]
We fix \(k=100\). This choice is motivated by the fact that each node with at most three parents has up to \(2^3 = 8\) possible parent-state configurations. Setting \(k=100\) corresponds to approximately 10 observations per configuration per node, providing a conservative buffer against stochastic effects and subsampling losses. ***